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Abstract 

An estimate of the probability density function of a random vector is obtained by 
maximizing the mutual information between the input and the output of a feedforward 
network of sigmoidal units with respect to the input weights. Classification problems 
can be solved by selecting the class associated with the maximal estimated density. 
Newton’s method, applied to an estimated density, yields a recursive maximum likeli- 
hood estimator, consisting of a single internal layer of sigmoids, for a random variable 
or a random sequence. Applications to the diamond classification and to the prediction 
of a sun-spot process are demonstrated. 
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1 Introduction 


Neural networks are being applied to a wide variaty of pattern recognition and signal pro- 
cessing problems. Statistical and Information theoretic methods are playing an increasing 
role in the design and analysis of such networks. The representation of probability density 
functions by neural networks (e.g., [1] - [6]) has been of particulat interest. Employing well 
established statistical performance criteria in neural network design leads not only to the de- 
velopment of new tools for problems that have been traditionally solved by linear regression 
methods, but to a more profound understanding and a more efficient application of neural 
networks. 

In this w r ork we first employ the maximum mutual information criterion in deriving the 
parameters of a feedforward sigmoidal network which produces an estimate of the probability 
density function (pdf) of a random vector. This criterion has been used recently in developing 
learning ([8, 9]) and feature selection [10] methods. The estimated pdf obtained for each 
class can be used as a comparative measure in solving classification problem. Then we 
derive a recursive maximum likelihood estimator for a random variable, given a random 
vector. This estimator employs the parameters calculated by the pdf estimator, and can be 
used in an adaptive mode. Application in the prediction of random sequences is immediate. 
Employing a particular sigmoidal nonlinearity (tanh(VLx -f f)) produces explicit expressions 
for the parameters of the resulting algorithms. Applications to real classification and process 
prediction problems are described. 


2 Mutual Information and PDF Estimation by Sig- 
moids 


Let x € R n and y € R n be random vectors, having probability density functions px(x) and 
py(y), respectively. The mutual information between x and y is defined as [12] 

I{x,y) = h(x) + h(y) - h(x,y ) 


where 

h(x) = -E r {logpx(x)} 

with E x { ) denoting expectation with respect to px(x), is the entropy of x. Put in the form 

I(x,y) = h(x) -(- h(x | y ) 

the mutual information between x and y is known to be the “information about x contained 
in y ” (symmetriacally, of course, it is the “information about y contained in x”)[12]. If y is 
to be used in making inferences about x, it is desirable to maximize the mutual information 
between them. 
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Let the z’th component of the vector y be 


Vi = 


1 

det W 


G{(ui) 


(1) 


where W € R 2 is a real nonsingular matrix, Gi(-) is a monotone increasing, continuous 
bounded function, a “sigmoid”, and 


«,• = Wfx + U (2) 

where IT’, is the fth row of W and t\ is a scalar “threshold”. In vector form 

1 


y = 


;G(u) 


det W 

where y = [j/i, . . . , y n } T , u = [u a , . . . , u n ] T , and G(u) = [Gi (iq ),•••, G’ n (u n )] T . 
The probability density function of y satisfies [11] 

p x (x) 


(3) 


py(y) = 


x=G- 1 (s/) 


(4) 


det J(x) 

where det J(x) is the determinant of the Jacobian of y = G(x), whose i, j’th component is 

dyi(x) 


Ji, = 


dxj 


Clearly, J(x ) is a square matrix, and, for sigmoidal G,(x), the vector x = G *(y), whose 
components are G, -1 (x), is well defined. It follows that 

J(x,y) = h(y) = h(x) + T;{log | det J(x) |) 

Since F{logp x (x)j does not depend on W or t , the maximum mutual information criterion 
becomes 

max .E'Oog | det J(x) |} (5) 

W,t 


How does the maximum mutual information criterion apply to the estimation of the pdf 
of a random vector? First note that 

-I(x,y) = £{logp*(x)} - £{log | det J(x) |} = D{p x (x), \ detJ(x) |) 

where D(px(x), \ det J(x) |) is the divergence between px(x) and | det(J(x)) |. Furthermore, 
since y € [0, l] n , we have h(y ) < 0, as the maximal entropy of any density on [0, 1]" is not 
greater than 0, the entropy of a uniform density [12]. It follows that D(px{x), | det J(x) |) > 
0, hence, maximizing the mutual information between x and y is equivalent to minimizing 
the divergence between px(x) and det(J(x)). Noting that 

dyjjx) _ y, dyj{x)du k {x) 
dxj duk dxj 
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it follows that 


det(J) = n *(«.-) 

»=i 

where 

g(ui) = dG(ui)/dui 

Since G(ui) is monotone increasing in u,-, it follows that g(uj) > 0, hence 

| det(J) |= det(J) 

The proposed pdf estimate is then 

Px( x ) = det J(x ) 


3 Parameter Adaptation Algorithm 


To find the optimal parameters, we need to maximize 

S =‘£{logdet J(z)} = E {log g{W?x + <,■)} 

i=i 

The gradients of the latter with respect to W and t = (< x , . . . ,t n ) T are found 



V W S = E{B(u)x t } 

and 

V ( 5 = E{B(u)} 

where 

B{u) = [6(u 1 ),...,6(u n )] r 

with 

,, . d . dg(ui)/d\ 

H*i) = ^ log g(u,-) = 

oui g[Ui ) 

We use, 

in particular, the function 


Gi(ui) = 0.5[1 + tanh(ui)] 

for which 

, , 0.5 

sM ~ COSh 2 ( Uj ) 


hence 

b(u.i) = — 2tanh(u,) 

Iterative algorithms of the form 

W(k + 1) = W{k) + p(k)VS(k) 


where p(k) is a step size control parameter and VS(k) is an empirical version of VS, can 
be used in searching for the optimal parameters. One possibility is replacing the expecta- 
tions in (9) and (10) by empirical averages over the input samples. Another is replacing 
them by the samples themselves. For instance, E{B[W(k)x 4- f]x r } would be replaced 
by H;=i B[W(k)x^ + f]xb) , where x^l is the Uth training input vector, or simply by 
B[W(k)x.W + Using the inverse of the Hessian of S with respect to the parameters 

as the step size control parameter is likely to speed up the convergence rate of the algorithm 
[13], although its computation may require considerable time for high dimensional inputs. 
The Hessian in our case is a four dimensional tensor. Updating the columns of W one at a 
time, the Hessian for the m’th column at the fc’th iteration is a matrix whose i,j ' th element 
is 


V 2 S(fc) (m) ]. . = 

J 1 



'Em.'Ej 

cosh 2 (u,(fc)) 



X (0*M 

x m x j 

cosh 2 (u t (ifc)) 


( 12 ) 


An immediate application of the pdf estimate is in classification problems. In training, 
the pdf corresponding to each of the classes is learnt by a different set of parameters. In 
operation, the class corresponding to the largest pdf is selected. Another application is in 
estimating random variables and random sequences. This is discussed next. 


4 Estimating Random Variables and Sequences 


Let x be a random variable, having a probability density function (pdf) p*(x), let y be a 
vector of n random variables having a joint pdf py{y), and let the joint pdf of x and Y be 
denoted pxy(x, y). The maximum likelihood estimate of x given y is obtained by maximizing 
the corresponding conditional pdf 

p{x[y)= ^r 

with respect to x, which is the same as maximizing px,y (z, y) with respect to x. 

In real problems, px,y(x,y ) is not available and must be estimated from the data. 
Defining 

x = (y T , x) T 

as the vector obtained by concatenating y and x, the proposed pdf estimate of x is 

N 

PA*) = nsw i+< *') 

t=i 

w'here W x is the Fth row of the weights matrix obtained in estimating Px(x) by the algorithm 
described in the previous section. 
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Maximization of px,y(x, y) is equivalent to maximization of log px,y (z, t/), which is, in 
turn, equivalent to the maximization of 

f(x) = ^log^W.x + U) 

i—i 


Newton’s iterative optimization algorithm for maximizing f(x) with respect to the estimated 
variable x is [13] 


x(k) = x(k — 1) + 


[97(i)l 

- 1 dm 

dx 2 

dx 


x=x(k— 1) 


In our case 

03 

* s 

II 

jlM 3 

and 

«7(i) Y- 

h 

We have 

g(W,x + t) = 

It follows that 

— log g(WiX + U) 

and 



dx 


dx 2 


0.5 


cosh 2 (U / ii -f t{) 


dx 2 


log g{W,x + t) = -W? 


Hence the iterative algorithm (4) becomes 

x (k) = x(k - 1) + ix(k - l)W MT (k - 1 )z(k - 1) 


(13) 


where 


p{k - 1) = 






L«=l 


W^{k) is the last column of the weights matrix W(fc) and z(k — 1) is the vector whose i’th 
component is 

Zi(k — 1) = tanh (u,-(Ar — 1)) 

where u t (k) = W(k)x(k) + t(k). 


The problem of predicting the value of a random sequence X\,X 2 , , at instance 

n given N previous values- is naturally addressed by the proposed method. Simply define 
y = (xn-A', . . . , x n _i) T and x = x n and apply the algorithm described above. 
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